Temporal dynamics of user activities: deep learning strategies and mathematical modeling for long-term and short-term profiling

Profiling social media users is an analytical approach to generate an extensive blueprint of user’s personal characteristics, which can be useful for a diverse range of applications, such as targeted marketing and personalized recommendations. Although social user profiling has gained substantial attention in recent years, effectively constructing a collaborative model that could describe long and short-term profiles is still challenging. In this paper, we will discuss the profiling problem from two perspectives; how to mathematically model and track user’s behavior over short and long periods and how to enhance the classification of user’s activities. Using mathematical equations, our model can define periods in which the user's interests abruptly changed. A dataset consisting of 30,000 tweets was built and manually annotated into 10 topic categories. Bi-LSTM and GRU models are applied to classify the user’s activities representing his interests, which then are utilized to create and model the dynamic profile. In addition, the effect of word embedding techniques and pre-trained classification models on the accuracy of the classification process is explored in this research.


User's interests and profiles
Many researchers have discussed user profiling (or user classification) on SMNs for various purposes and using different techniques.In their research 1 presented a Behavior Factorization (BF) model for constructing topic interest profiles for social media users.They analyzed a large quantity of behavior data from users in Google+ and found that users' topic interests exhibited by one type of behavior are different from other types.To build the profile, the BF first learns a latent embedding model by factorizing matrices separated by behaviors, then builds user topic profiles for different types of behaviors using this embedding model.Dougnon et al. 2 designed an algorithm called Partial Graph Profile Inference+ (PGPI+) to infer users' profiles under a partial social graph constraint.The algorithm does not need training, and it offers the advantage of user control over the balance between the extent of gathered information for profile inference and the resulting inference accuracy.The algorithm has the advantage of using useful information like friendship links, user profiles, and group memberships, as well as the" likes" and" views" from social networks such as Facebook when available.
On-at et al. 3 proposed a dynamic keyword-based user profile that represents his interests through numerical weights.They used the user's egocentric networks as sources to collect necessary information about his interests and to build his social profile.In order to achieve the dynamic concept and to reflect the evolution of users' interests, a scoring function is used with temporal criteria to weigh each extracted element and information of the user's social networks.Farnadi et al. 4 presented a hybrid deep learning user profiling framework based on both user's generated content and their social relational content.It employs a common representation across modalities, facilitating the fusion of data from three distinct sources (visual, textual, and relational) at the feature level.At the decision level, the approach combines the resulting decisions from different networks that operate on each collection of data sources to obtain better profiling.Chen et al. 5 developed a semi-supervised classification paradigm to predict a user's profile using a heterogeneous graph structure.In their heterogeneous graph attention networks (HGAT) model, the entities of interest (e.g., items, users, attributes of items, etc.) are represented as nodes, while the interactions between entities are the edges.The model learns the representation of each entity by considering the graph structure and then uses the attention mechanism to determine the relevance of each neighbor entity.
For influencer marketing, 6 introduced a multimodal deep learning model that utilizes both text and image data of Instagram users' posts to classify both influencers and their individual posts into specific topics and interests (s.a., family and fitness).To the best generation of influencer representations, they identified the more relevant posts to the topics of influencers using the attention mechanism.De Campos et al. 7 represented the users by hybridizing two different homogeneous sub-profiles (temporally and topically).To construct the topical sub-profiles, they used LDA (Latent Dirichlet Allocation) for performing a clustering process.The temporally sub-profiles are built by dividing the user's interactions into time intervals and computing the frequency of interactions within each interval.Finally, it combines both prior methods of profile construction by simultaneously leveraging the topical and temporal aspects in order to obtain consistent sub-profiles in terms of both traits.Table 1 shows a comparative analysis of the aforementioned research.It is important to acknowledge that the lack of standardized datasets and benchmarks makes it unfair to compare profiling methods directly.Furthermore, the variations in platforms, user demographics, profiling criteria, techniques, and evaluation methodologies across studies make a comprehensive and accurate comparison challenging.Our efforts have focused on evaluating user profiling methods across multiple tasks or settings to gain insights into the strengths and limitations of different profiling techniques.As a result, the comparison will be approximate in terms of evaluating criteria, results, and the strengths and weaknesses of each method.

Text classification
Classifying the user's generated content is an essential step in generating his dynamic profile.Many research papers discussed text classification problems and proposed different solutions.In their research, 8 tried to enhance the accuracy and effectiveness of text classification by proposing a novel term weighting approach.They adopted an existing TextCNN model 9 by combining the word embeddings with the new scheme of term weighting that takes into account the varying importance of terms in documents with different class labels.The scheme assigns multiple weights to every term so that each weight can appropriately reflect its importance to the documents coming from different text classes.For the multi-label classification task, 10 presented a sequence-to-sequence (Seq2Seq) based learning model, which captures both local and global semantic information in text through its encoder and decoder modules.The encoder combines CNN and recurrent neural network (RNN) together to extract the local semantic features and capture long-range distance dependencies of features.The decoder, on the other hand, employs RNN to capture the global label correlation and also initialize a fully connected layer that reflects the correlation between any two different labels.
Xu et al. 11 proposed a solution for data sparsity in a deep learning classification model for short text by utilizing a probabilistic knowledge base to represent words and sentences.Data sparsity refers to the fact that short texts often contain too few words to provide enough information for accurate classification, which affects the performance of the classification.They combined word embeddings and concept embeddings to enrich text representation and help the model utilize word-level knowledge instead of sentence-level.Li et al. 12 suggested a recursive data-pruning solution for the misfitting problem in a CNN model used for text classification, which means that CNNs may capture irrelevant words in the dataset due to limited training samples and over-parameterization, which can lead to unsatisfactory performance in text classification tasks.Their solution started after standard training by evaluating all convolutional filters based on the discriminative power of generated features in the pooling layer.Subsequently, filters exhibiting lower evaluation scores are determined, and the words associated with these poorly performing filters are removed from the training data.This process is iterated to recursively eliminate the task's irrelevant words.Eventually, the cleaned data is used to train the single convolutional layer CNN model, which leads to better generalization.
To improve the performance of short text classification, 13 explored the use of word taxonomies to construct semantic feature vectors that are used to enhance the feature vectors generated by traditional text processing algorithms such as tf-idf.Their tax2vec approach helps in exploring and understanding how the external semantic information could be incorporated into current (black box) machine learning algorithms, as well as revealing the nature of the acquired knowledge.Semantic features were also used by 14 with a modified deep-learning model to improve the accuracy of short-text classification.They proposed an approach called CRFA (Context-Relevant Features with multi-stage Attention based on Temporal Convolutional Network (TCN) and CNN), which consists of 3 layers: embedding, representation, and output layer.To reduce short-text ambiguity and sparsity, they used an external knowledge base called "Probase" within the embedding layer to enhance the representation on both word and concept levels.The representation layer is composed of a two-level TCN-based attention model, WTCN (Word-level TCN) and CTCN (Concept-level TCN), to select discriminative concepts and word features for short text classification.

Proposed framework
Our framework has two main axes: classifying the user's activities and constructing his dynamic profile.The following subsections clarify each axis.

User profile with temporal dynamics
Weighted-based user profile is a representation in which the user profile is represented by a keyword or a set of keywords that is directly provided by the system or automatically extracted from web pages or documents.Keywords are associated with numerical weights to represent the user's interests in different topics or categories.
In our previous research 15 , we considered a user u inside the social media group ɡ, with a static profile P u and discussing N topics.We used a weighted-based user profile to present the dynamic profile of the user.D u (t) , which reflects the position x u (m-dimensions) of the user inside the topic sphere such that is the distance between the user and the jth topic after the ith itera- tion is a representation in which the user profile is represented by a keyword or a set of keywords that is directly provided by the system or automatically extracted from web pages or documents.Keywords are associated with numerical weights representing the user's interests in different topics or categories.
Our model is based on the following assumptions about the connection between the user and topics: 1.The topics the user is interested in represent 100% of his mind.
2. The total similarity between the user and each topic depends on the user's static profile sim c j u (t 0 ) , the user's activities A_sim c j u (t) , and the user's following list F_sim c j u (t).
Vol:.( 1234567890 www.nature.com/scientificreports/ 3. The user's interests found in his static profile are used to calculate the initial similarity between the user and each topic c j .4. User's activities like posts P, shares S, or likes L have different significance weights. 5.The similarities between the user and the topic increased as the distance between the user and the topic decreased.6.The distance between the user and each topic changed after each activity.
Consider bloggers who use social media to display their daily activities and aren't interested in wars or disasters.One day, a catastrophe occurred in their country, so they used their social accounts to express their feelings and to support the victims, etc.Their user profiles should reflect the unusual reaction to the crisis as a short-term interest and the entertainment and other elder interests as long-term ones.
In this paper, we will introduce how to use our model to accommodate the short-term and long-term profiles.

Definition 1 (Temporal user profile)
The temporal profile D u (time) of user u is the position x u of the user inside the topic sphere based on specific timespans.
where d c j u (time) is the distance between the user and the jth topic category at the end of a given period.For the long-term profile, the beginning point of the user is the creation of the profile till the current moment.Accordingly, the initial values will be determined as mentioned in the 3rd point by using the user's static profile.On the other hand, the beginning of the user in the short-term profile is the start of the specified period.Hence, the start values of d c j u will be the user's dynamic profile at the beginning of the time span.Using the temporal-based profile, we can explore how the user profile evolves over time; for example, we could investigate if there are any variations between the user's profile generated on weekends compared to his profile on weekdays, etc.
In order to measure the difference between the two profiles, we apply the Manhattan distance (also known as L1-distance) in vector representation: The higher the L 1 value, the larger the disparity between the two profiles, and vice versa.Manhattan distance provides an overall measure of similarity or dissimilarity between the two profiles.As it calculates the distance between two points by summing the absolute differences in their coordinates, it is more robust to outliers and variations in individual dimensions (i.e., it does not specify which interests contribute more or less to the overall distance).To analyze the user's behavior and detect if there is any unexpected change in it, we will calculate the squared differences to obtain more detailed information about the differences between each corresponding distance in the two profiles.
The squared difference is used to calculate the squared value of the difference between the corresponding coordinates of two points in a multidimensional space.It is useful when assessing the magnitude of change within specific categories, as it amplifies differences between values.The squared distance may be sensitive to outliers and can overemphasize large differences, so it's typically utilized at the category level rather than for overall profile changes.By setting specific thresholds or criteria, we can define significant differences in user behavior or discover unusual changes in user interests.For example, we might consider elements with squared differences above a certain threshold to reflect a significant change.Criteria such as when a user becomes interested in a topic for the first time and for how long he was interested in it could be an indicator of whether it is a temporary change or if it will be a lasting one.

Text-topic classification
Classifying the activities of a user is a key task in creating his dynamic profile.Since deep learning models have consistently proven their effectiveness in resolving numerous text classification challenges, we used them to classify text into specific topics.Figure 1 shows an overview of the proposed models.

Data collection and preprocessing
We applied the models to two sets of tweets; the first one is the tweet data set collected by 16 , which consists of 22,424 manually labeled tweets divided into 11 topic categories (C1) business/finance, (C2) crisis [disaster/ war], (C3) entertainment, (C4) politics, (C5) health/medical, (C6) law/crime, (C7) weather, (C8) life/society, (C9) sports, (C10) technology/internet, and (C11) others distributed as shown in Table 2.We observed that the dataset is imbalanced as there is a substantial disparity in the number of tweets between different classes, which could affect the performance of classifiers.
In order to handle this problem, we modified the dataset in a way that each class contains 3500 tweets.For classes with tweets less than 3500, we collected relevant tweets using Twitter API to reach the specified number; on the other hand, classes with tweets more than 3500 are deducted by randomly removing redundant tweets.The final dataset consists of 35,000 tweets distributed equally between 10 categories by eliminating the 'others' class C11.
Preprocessing steps are applied to ensure that the tweets are clean and suitable for the classification process.We lowercase all tweets to eliminate case-related variations.Special characters except ($ and %), punctuations, URLs, mentions, and hashtags are removed.After that, we applied tweet tokenization by the tokenizer in the NLTK package.

Word embedding
After the tokenization, the tweet's text is represented as vectors (numerical values) using an embedding model.Word embeddings are a type of distributed representation in an n-dimensional space designed to capture the semantic meanings of words.We used two distributed pre-trained word embedding models, GloVe 17 and FastText 18 , to capture the semantic meaning of words in a sequence of text.Glove focuses on capturing global co-occurrence statistics of words in large text corpora, aiming to represent words based on their contextual relationships.In our model, we used GloVe embeddings that are trained on a large corpus with 300d vectors.FastText is an algorithm developed by Facebook that treats each word as a combination of n-gram characters, allowing it to represent out-of-vocabulary words and morphological variations effectively.FastText offers more flexibility and robustness in handling a wide range of languages and text types.We used FastText and GloVe separately and compared the results to study which one has a better impact on achieving higher classification accuracy.1. Recurrent Neural Networks (RNNs): These are a type of neural network designed for processing sequential data.They have a unique ability to maintain an internal memory or hidden state that allows them to capture dependencies over time.However, traditional RNNs suffer from vanishing gradient problems during training, making it challenging to capture long-term dependencies effectively.To solve these issues, several modifications and variants of RNNs have been developed.Long Short-Term Memory (LSTM) networks 19 .introduce sophisticated gating mechanisms to control the flow of information, enabling them to capture long-range dependencies.Bidirectional LSTM (Bi-LSTM) 20 processes data in both forward and backward directions, enhancing context understanding.Gated Recurrent Unit (GRU) 21 is another variant of RNNs that is known for its efficiency and simplicity.They are effective at capturing sequential patterns and have been widely employed in various natural language processing tasks, text classification, and time series prediction, offering a balance between computational efficiency and modeling capability.2. BERT Model: BERT 22 is a transformer-based model that could be fine-tuned to solve a wide range of realworld NLP tasks.Fine-tuning BERT to classify text typically involves feeding labeled data to BERT and updating its parameters through backpropagation.This process allows BERT to leverage its pre-trained knowledge of language and semantics to excel in the classification task, often achieving state-of-the-art results with relatively little training data.In our experiments, we used a compact version of BERT called DistilBERT 23 that is designed to be smaller and faster while maintaining much of BERT's language understanding capabilities.It achieves this by employing knowledge distillation techniques during training, where it learns from a larger pre-trained BERT model.The key distinctions lie in the reduced size and efficiency of DistilBERT, making it more suitable for applications with limited computational resources or a need for faster inference.
The first layer of the DistilBERT model involves the initial preprocessing and transformation of raw tweet text data into a structured format that can be fed into the DistilBERT model for further processing and classification.It encompasses tokenization, padding, truncation, the addition of special tokens to create input tensors, and creating attention masks.DistilBERT takes the tokenized tweet text as input and generates contextualized embeddings for each token in the text.These embeddings capture semantic and contextual information.
The model variant used for classification is "DistilBERT-base-uncased." This variant is based on the Distil-BERT architecture and is case-insensitive (lowercase).It is a smaller and more efficient version of the original BERT model.DistilBERT models typically consist of 6 layers of transformer encoder blocks, 768 hidden dimensions, and 12 attention heads in each multi-head self-attention mechanism.The vocabulary size of DistilBERT is typically 30,000.This means that the model can tokenize and work with a vocabulary of 30,000 unique sub-word pieces.

Evaluation metrics
The performance metrics used to evaluate our models are accuracy, precision, recall, and F1-score.Accuracy measures the overall correctness of the model's predictions by calculating the ratio of correctly classified instances to the total number of instances.
Precision evaluates the model's ability to make accurate positive predictions within each class, indicating the fraction of correctly predicted positive instances among all instances predicted as positive.
Recall, on the other hand, gauges the model's ability to capture all positive instances within each class, measuring the fraction of correctly predicted positive instances among all actual positive instances.
The F1-score is a balanced measure that combines precision and recall, providing a single value that reflects the model's overall performance across all classes.
Weighted average (WA) and macro average (MA) are two approaches for aggregating precision, recall, and F1-score metrics.Weighted average takes into account the class imbalance by assigning weights based on class proportions, giving more importance to the majority classes.This is useful when optimizing the model's performance with respect to class distribution.In contrast, macro average treats all classes equally, providing an unbiased assessment of the model's ability to perform across all classes, regardless of size or imbalance.

Text classification experiments
This section presents and discusses the experiments with the text-topic classification models.Our experiments are divided into three main dimensions: Studying the effect of the imbalanced dataset on the classification accuracy, studying the effect of feature extraction techniques, and the effect of using pre-trained models in the classification task.The datasets in all experiments are divided into two parts: 80% as the training set and 20% as the test set.
Experiment 1: In the first experiment, the Bi-LSTM and GRU models are applied to both the old and new datasets.Table 3 shows the significant change in performance across all metrics between the old and new datasets, showcasing the effectiveness of the updated dataset.This improvement in the performance between the old and new datasets suggests that the models have learned patterns that generalize better to unseen data.

Experiment 2:
The second experiment is conducted to study the effect of different pre-trained word embeddings on the accuracy of classification using our new dataset.GloVe and FastText are used to construct the embedding matrix.This matrix serves as the initial weights for the embedding layer of our model.We chose a 300-dimensional vector to represent each word in the vocabulary, which passed to the next layer (Bi-LSTM or GRU).The models were trained for 50 epochs using the hyperparameters shown in Table 4.
In Table 5, the achieved results of the 2-models, along with pre-trained FastText and GloVe word embeddings, are illustrated.From the Table, we can see that (1) The Bi-LSTM model with FastText gives the best results, (2) The Bi-LSTM model achieves better results than GRU, and (3) FastText embeddings helped the models to achieve better accuracy.

Experiment 3:
The final experiment is conducted also on our new dataset to compare the performance of Dis-tilBERT when it is fine-tuned as a classifier with the RNNs models' performance.The key configurations of our  6, which is better than previous RNN models, as shown in Fig. 2.
For more analysis of the best model, Fig. 3 shows the confusion matrix, where the details of True positive (TP), False Positive (TP), True Negative (TP), and False Negative (TP) for each class are presented.We can notice that the "Business-Finance" class has many tweets that are classified as "Technology-Internet" and vice versa, which   The results show a difference between the users' short profiles, especially in the second topic, which has a higher squared difference.

Figure 1 .
Figure 1.The architecture of proposed topic-classification models.
Number of correct predictions of the topic(TP) Total number of instances predicted as that topic(TP + FP) (5) Recall = Number of correct predictions of the topic(TP) Total number of instances actually in that topic(TP + FN) (6) F1 − Score = 2 × precision × recall precision + recall

Figure 2 .
Figure 2. Comparison between all classification models.

Figure 3 .
Figure 3.The confusion matrix of DistilBERT model.

Table 9 .
User's short and long-term profiles and distribution of user's activities (A_T: Activity type, T: number of tweets, R: number of retweets and L: number of Likes).

Table 1 .
A rough comparison between profiling research.

Table 3 .
Comparison between Bi-LSTM and GRU models on the old and new datasets.

Table 4 .
Hyperparameters used in Bi-LSTM and GRU models.

Table 6 .
The performance metrics of DistilBERT.